Kickstarter Exploratory Data Analysis

This is part of Abdulrahman Alkhamees project work for Data Analysis Nanodegree

Introduction

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life.” Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects such as films, music, stage shows, comics, journalism, video games, technology, and food-related projects. source https://en.wikipedia.org/wiki/Kickstarter

I’ve been using Kickstarter for a while and I find it a good source of inspiration and for purchasing awesome stuff, especially when it comes to products.

This dataset was obtained from Kaggle https://www.kaggle.com/kemical/kickstarter-projects/data. It contains around 378,000 projects with more than 13 variables, Columns are self- explanatory; you may need to visit the platform to get an understanding of it.

Univariate Plots Section

Let’s first start with a summary of the data and data types. It allows us to get a perspective when asking questions or whether we need to need to make any adjustments before visualizing some of the code blocks.

##        ID                                  name       
##  Min.   :5.971e+03   New EP/Music Development:    41  
##  1st Qu.:5.383e+08   Canceled (Canceled)     :    13  
##  Median :1.075e+09   Music Video             :    11  
##  Mean   :1.075e+09   N/A (Canceled)          :    11  
##  3rd Qu.:1.610e+09   Cancelled (Canceled)    :    10  
##  Max.   :2.147e+09   Debut Album             :    10  
##                      (Other)                 :378565  
##            category           main_category       currency     
##  Product Design: 22314   Film & Video: 63585   USD    :295365  
##  Documentary   : 16139   Music       : 51918   GBP    : 34132  
##  Music         : 15727   Publishing  : 39874   EUR    : 17405  
##  Tabletop Games: 14180   Games       : 35231   CAD    : 14962  
##  Shorts        : 12357   Technology  : 32569   AUD    :  7950  
##  Video Games   : 11830   Design      : 30070   SEK    :  1788  
##  (Other)       :286114   (Other)     :125414   (Other):  7059  
##        deadline           goal                          launched     
##  2014-08-08:   705   Min.   :        0   1970-01-01 01:00:00:     7  
##  2014-08-10:   558   1st Qu.:     2000   2009-09-15 05:56:28:     2  
##  2014-08-07:   541   Median :     5200   2010-06-30 17:29:43:     2  
##  2015-05-01:   489   Mean   :    49081   2011-02-08 04:29:48:     2  
##  2014-08-09:   477   3rd Qu.:    16000   2011-02-25 09:58:36:     2  
##  2015-07-01:   449   Max.   :100000000   2011-03-03 17:55:38:     2  
##  (Other)   :375442                       (Other)            :378644  
##     pledged                state           backers        
##  Min.   :       0   canceled  : 38779   Min.   :     0.0  
##  1st Qu.:      30   failed    :197719   1st Qu.:     2.0  
##  Median :     620   live      :  2799   Median :    12.0  
##  Mean   :    9683   successful:133956   Mean   :   105.6  
##  3rd Qu.:    4076   suspended :  1846   3rd Qu.:    56.0  
##  Max.   :20338986   undefined :  3562   Max.   :219382.0  
##                                                           
##     country        usd_pledged          p_funded       
##  US     :292627   Min.   :       0   Min.   :       0  
##  GB     : 33672   1st Qu.:      31   1st Qu.:       0  
##  CA     : 14756   Median :     624   Median :      13  
##  AU     :  7839   Mean   :    9059   Mean   :     324  
##  DE     :  4171   3rd Qu.:    4050   3rd Qu.:     107  
##  N,0"   :  3797   Max.   :20338986   Max.   :10427789  
##  (Other): 21799
## 'data.frame':    378661 obs. of  14 variables:
##  $ ID           : int  1000002330 1000003930 1000004038 1000007540 1000011046 1000014025 1000023410 1000030581 1000034518 100004195 ...
##  $ name         : Factor w/ 375767 levels "","    IT\x92S A HOT CAPPUCCINO NIGHT  ",..: 332540 135688 364966 344807 77347 206129 293463 69359 284138 290720 ...
##  $ category     : Factor w/ 159 levels "3D Printing",..: 109 94 94 91 56 124 59 42 114 40 ...
##  $ main_category: Factor w/ 15 levels "Art","Comics",..: 13 7 7 11 7 8 8 8 5 7 ...
##  $ currency     : Factor w/ 14 levels "AUD","CAD","CHF",..: 6 14 14 14 14 14 14 14 14 14 ...
##  $ deadline     : Factor w/ 3164 levels "2009-05-03","2009-05-16",..: 2288 3042 1333 1017 2247 2463 1996 2448 1790 1863 ...
##  $ goal         : num  1000 30000 45000 5000 19500 50000 1000 25000 125000 65000 ...
##  $ launched     : Factor w/ 378089 levels "1970-01-01 01:00:00",..: 243292 361975 80409 46557 235943 278600 187500 274014 139367 153766 ...
##  $ pledged      : num  0 2421 220 1 1283 ...
##  $ state        : Factor w/ 6 levels "canceled","failed",..: 2 2 2 2 1 4 4 2 1 1 ...
##  $ backers      : int  0 15 3 1 14 224 16 40 58 43 ...
##  $ country      : Factor w/ 23 levels "AT","AU","BE",..: 10 23 23 23 23 23 23 23 23 23 ...
##  $ usd_pledged  : num  0 2421 220 1 1283 ...
##  $ p_funded     : num  0 8.07 0.489 0.02 6.579 ...

It looks like my variables have a good sense of correct data types. We have “launched date - deadline” that needs to be parsed, But I am not planning on using it for now. So, let’s get started!

ggplot(aes(x=state), data=ks) +
   geom_bar() 

This plot shows the project counts per state. Funding on Kickstarter is all-or-nothing. As we can see here, a large portion of the projects fail.

Next, let’s have a look at the project submission per country.

ggplot(aes(x=country), data=ks) +
   geom_bar()

As expected, the US is leading.

Next, we can see the main projects categories. In this dataset, we have a main category & a sub-category.

ggplot(aes(x=main_category), data=ks) +
   geom_bar() +
 theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=10, face="bold"), axis.text.x=element_text(size=10, angle=90))

As can be seen in the graph, Film & Video got the highest number of projects submitted.

Next, we will have a look at the most used currencies.

ggplot(aes(x=currency), data=ks) +
   geom_bar()

Again, USD is the most used currency. I wonder whether you can specify the currency for your project or if it’s enforced by your country or bank account.

Next, I wanted to get an idea on the project fund percentage to see how far projects reach or exceed their goals.

 ks %>% group_by(state) %>% 
  summarize(count=n(), mean=mean(p_funded)) %>%
arrange(desc(count))
## # A tibble: 6 x 3
##   state       count   mean
##   <fct>       <int>  <dbl>
## 1 failed     197719   9.06
## 2 successful 133956 856   
## 3 canceled    38779 124   
## 4 undefined    3562  57.4 
## 5 live         2799 289   
## 6 suspended    1846 155

Impressive! Successful projects get on average 856% of their goals! On the contrary, failed projects don’t exceed 10% of their initial goal, which makes sense.

Univariate Analysis

What is the structure of your dataset?

It has 378661 rows, 13 variables, factors, and integers. In general, I would say it’s clean.

What is/are the main feature(s) of interest in your dataset?

  • Number of successful projects vs failed (State)
  • Country popularity (Country)
  • Categories that get funded (Main category)

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Pledges, Goals, and backers are all features that will guide me and explain my feature(s) of interest.

However, there are other features that I wish the dataset had: - Ages of people who back these projects - When/for how long do the projects get backed? 1hr, last hour of the project life? - Number of updates per project - Number of comments per project ### Did you create any new variables from existing variables in the dataset? Yes, I created a variable called p_funded, which basically the percentage of a project that gets funded.

Of the features you investigated, were there any unusual distributions?
?

Yes. First, I will display the ones with a normal distribution followed by the ones which are skewed to the right. The following plots follow a normal distribution:

# Force R to not use exponential notation 
options(scipen=999)
ggplot(ks, aes(x=goal))  + geom_histogram()

ggplot(ks, aes(x=usd_pledged))  + geom_histogram()

ggplot(ks, aes(x=pledged))  + geom_histogram()

Again, we cannot see the distribution without scaling the plots. So, let’s scale them!

ggplot(ks, aes(x=goal))  + geom_histogram() + scale_x_log10()

ggplot(ks, aes(x=usd_pledged))  + geom_histogram() + scale_x_log10()

ggplot(ks, aes(x=pledged))  + geom_histogram() + scale_x_log10()

The following one is a little skewed to the right and has a weird gap in the middle.

ggplot(ks, aes(x=backers)) + geom_histogram()

It’s not clear without log scaling. Let’s log scale it!

ggplot(ks, aes(x=backers)) + scale_x_log10() + geom_histogram()

Here we go. I find the shape weird and this data requires further investigation.

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

No, the data was already clean. It was not altered in any way. I only created an additional variable.

Bivariate Plots Section

In the next plot, we are trying to explore the relationship between the goal of the project and the amount that was pledged in USD.

ggplot(aes(x = pledged, y = goal),
       data = ks) + geom_point()

Since there is a lot of data depicted in the above graph, the following things need to be done in order to make the plot readable: 1. Do log transformation, so that the patterns are more clearly visible 2. Put commas for x-axis & y-axis and force R not to display exponential notation

ggplot(aes(x = pledged, y = goal),
       data = ks) + geom_point() +
  scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)

Since we have a large dataset with over-plotting, the alpha aesthetic will make the points more transparent. Let’s try that.

ggplot(aes(x = pledged, y = goal),
       data = ks) + geom_point(alpha = 1/10) +
  scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)

Here we go. This is a very interesting plot indeed! In the next section, we are going to classify those projects based on their state. Next, let’s explore if there’s a relationship between the usd_pledged and the number of project backers.

ggplot(aes(x = (usd_pledged), y = (backers) ),data = ks) + geom_point()

This is yet another interesting plot. I think we have a positive relationship here. Let’s now limit the plot to see the relationship more clearly.

ggplot(aes(x = (usd_pledged), y = (backers) ),data = ks) + geom_point(alpha = 1/5) +
  xlim(0,10000000) + ylim(0,100000)

There you go. It looks to me like it’s a positive, linear relationship. Let’s confirm that by running the Pearson’s Correlation Test.

cor(ks$backers, ks$usd_pledged)
## [1] 0.7525394

As expected, it indicates a positive relationship. Next, we are going to explore the Amount Pledged vs. Project Category.

ggplot(ks, aes(main_category, usd_pledged)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90)) 

Interesting! This plot shows how our data has outliers. In the final section, we will clean the data and present it in a way such that we can understand it better. It is later followed by a 5 number summary.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • I’ve analyzed the relationship between backers and usd_pledged.
  • I’ve explored the relationship between the goal and usd_pledged.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I did not find anything unexpected.

What was the strongest relationship you found?

The strongest relationship I found was between backers and usd_pledged.

Multivariate Plots Section

I am starting this section with a plot of Goal vs Pledged, colored by the state of the project.

ggplot(aes(x = usd_pledged, y = goal, color=state),
       data = ks) + geom_point() 

Okay, it’s nice. But it can be made more readable if a log transformation is done. Let’s try that.

ggplot(aes(x = usd_pledged, y = goal, color=state),
       data = ks) + geom_point() +
  scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)

What if we color-code it based on currency?

ggplot(aes(x = usd_pledged, y = backers, color = currency ),data = ks) + geom_point() 

Interesting. Could we see digital currencies soon? Let’s add the main category to the picture.

ggplot(aes(x = usd_pledged, y = goal, color = main_category ),data = ks) + geom_point() 

Interesting, Now, let’s scale it.

ggplot(aes(x = usd_pledged, y = goal, color = main_category ),data = ks) + geom_point() +  scale_x_log10(labels=scales::comma) + scale_y_log10(labels=scales::comma)

Here you go. Looks like technology all over the place! I think it’s not an easy plot to interpret.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Here, we can see that most of the successful projects are closer to the x-axis and have less funding which is what this platform is perfect for.

Were there any interesting or surprising interactions between features?

Yeah, this was the case with backers vs. usd_pledged. When you color-code the data, you start to see the US start to take over, which is expected.


Final Plots and Summary

Plot One

ggplot(aes(x=main_category), data=ks) +
   geom_bar() +  theme(plot.title=element_text(hjust=0.5), axis.title=element_text(size=10, face="bold"), axis.text.x=element_text(size=10, angle=90)) + ggtitle("Number Of Projects  Per Category") + xlab("Project category") + ylab("Number of projects ")

Here, we see the number of projects published per category. My initial expectation was that most of the projects would fall under the technology category, but surprisingly, Film & Video is outperforming the categories, followed by Music. In the next plot, we gonna see how that fits into the amount pledged per $ spent on each category.

Plot Two

# inspired by this kernal from Kaggle https://www.kaggle.com/andrewjmah/kickstarter-exploratory-data-analysis-with-r
ggplot(ks, aes(main_category, usd_pledged)) + geom_boxplot() + 
  ggtitle("Amount Pledged vs. Project Category") + xlab("Project Category") + 
  ylab("Amount Pledged (USD)") + 
  theme(axis.text.x=element_text(angle=90)) + 
   scale_y_log10(labels=scales::comma)

First of all, looking at the median, Q1, and Q3 in each category in this plot gives you a clear idea of the amount people usually pledge per category. Generally, it looks like it’s not that far from one another (in terms of the Max, Mean etc.). Second of all, the Design, Games, and technology categories are the top categories that get pledged. This is very interesting since we saw in the previous plot how “Film & Video” was leading the submission per category, but the money says something else :)

Plot Three

ggplot(aes(y = goal, x = usd_pledged, color = state ),data = ks) + geom_point()  + scale_x_log10(labels=scales::comma)+scale_y_log10(labels=scales::comma) + xlab("Amount Pledged (USD)") + ylab("Project Goal (USD)") +
  ggtitle("Amount Pledged vs. Project Goal") 

There are a couple of things I find interesting in this graph. Firstly, it shows you right away where the amount spent per $ on successful projects end up with funding. Secondly, it provides an explanation of why you see a lot of failed projects on the y-axis. One of the reasons for failure is clearly a high goal (could be a scam as well) where the x-axis shows the average. Also, the more money you aim for, the less likely you are in getting a successful pledge, It also tells you how Kickstarter may not be the place to get funding for such kind of project from crowdsourcing platforms.

Reflection

In my opinion, the most important thing that I find personally effective is doing something you care about and have a genuine interest in. When I scanned the Udacity datasets that were recommended to me, I had no interest in any of the subjects that were presented. This is where, I guess, domain knowledge/interest comes in handy when asking questions. I started the datasets search and first chose a dataset from Kiva, which is a crowdsourcing platform for loans. It was painful to download/understand and clean the data; I spent a couple of days cleaning it and eventually, after cleaning it, I didn’t have enough variables to meet the dataset requirement. After this project, I started to appreciate tidy datasets. I also didn’t like R, probably because I am more familiar with python. But I liked the fact that I was exposed to it and was able to interact with its community. Finally, I think we can build models with this dataset to predict project success. That would be my next goal after exploring this dataset. This was fun.